12 research outputs found
FRUIT: Faithfully Reflecting Updated Information in Text
Textual knowledge bases such as Wikipedia require considerable effort to keep
up to date and consistent. While automated writing assistants could potentially
ease this burden, the problem of suggesting edits grounded in external
knowledge has been under-explored. In this paper, we introduce the novel
generation task of *faithfully reflecting updated information in text* (FRUIT)
where the goal is to update an existing article given new evidence. We release
the FRUIT-WIKI dataset, a collection of over 170K distantly supervised data
produced from pairs of Wikipedia snapshots, along with our data generation
pipeline and a gold evaluation set of 914 instances whose edits are guaranteed
to be supported by the evidence. We provide benchmark results for popular
generation systems as well as EDIT5 -- a T5-based approach tailored to editing
we introduce that establishes the state of the art. Our analysis shows that
developing models that can update articles faithfully requires new capabilities
for neural generation models, and opens doors to many new applications.Comment: v2.0, NAACL 202
BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
The proliferation of automatic faithfulness metrics for summarization has
produced a need for benchmarks to evaluate them. While existing benchmarks
measure the correlation with human judgements of faithfulness on
model-generated summaries, they are insufficient for diagnosing whether metrics
are: 1) consistent, i.e., indicate lower faithfulness as errors are introduced
into a summary, 2) effective on human-written texts, and 3) sensitive to
different error types (as summaries can contain multiple errors). To address
these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a
dataset of 889 human-written, minimally different summary pairs, where a single
error is introduced to a summary from the CNN/DailyMail dataset to produce an
unfaithful summary. We find BUMP complements existing benchmarks in a number of
ways: 1) the summaries in BUMP are harder to discriminate and less probable
under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be
used to measure the consistency of metrics, and reveals that the most
discriminative metrics tend not to be the most consistent, and 3) unlike
datasets containing generated summaries with multiple errors, BUMP enables the
measurement of metrics' performance on individual error types.Comment: Accepted as a long main conference paper at ACL 202
Propulsion Wheel Motor for an Electric Vehicle
A wheel assembly for an electric vehicle includes a wheel rim that is concentrically disposed about a central axis. A propulsion-braking module is disposed within an interior region of the wheel rim. The propulsion-braking module rotatably supports the wheel rim for rotation about the central axis. The propulsion-braking module includes a liquid cooled electric motor having a rotor rotatable about the central axis, and a stator disposed radially inside the rotor relative to the central axis. A motor-wheel interface hub is fixedly attached to the wheel rim, and is directly attached to the rotor for rotation with the rotor. The motor-wheel interface hub directly transmits torque from the electric motor to the wheel rim at a 1:1 ratio. The propulsion-braking module includes a drum brake system having an electric motor that rotates a cam device, which actuates the brake shoes
Active Bayesian Assessment for Black-Box Classifiers
Recent advances in machine learning have led to increased deployment of
black-box classifiers across a wide variety of applications. In many such
situations there is a critical need to both reliably assess the performance of
these pre-trained models and to perform this assessment in a label-efficient
manner (given that labels may be scarce and costly to collect). In this paper,
we introduce an active Bayesian approach for assessment of classifier
performance to satisfy the desiderata of both reliability and label-efficiency.
We begin by developing inference strategies to quantify uncertainty for common
assessment metrics such as accuracy, misclassification cost, and calibration
error. We then propose a general framework for active Bayesian assessment using
inferred uncertainty to guide efficient selection of instances for labeling,
enabling better performance assessment with fewer labels. We demonstrate
significant gains from our proposed active Bayesian approach via a series of
systematic empirical experiments assessing the performance of modern neural
classifiers (e.g., ResNet and BERT) on several standard image and text
classification datasets
Recommended from our members
Detecting conversation topics in primary care office visits from transcripts of patient-provider interactions.
ObjectiveAmid electronic health records, laboratory tests, and other technology, office-based patient and provider communication is still the heart of primary medical care. Patients typically present multiple complaints, requiring physicians to decide how to balance competing demands. How this time is allocated has implications for patient satisfaction, payments, and quality of care. We investigate the effectiveness of machine learning methods for automated annotation of medical topics in patient-provider dialog transcripts.Materials and methodsWe used dialog transcripts from 279 primary care visits to predict talk-turn topic labels. Different machine learning models were trained to operate on single or multiple local talk-turns (logistic classifiers, support vector machines, gated recurrent units) as well as sequential models that integrate information across talk-turn sequences (conditional random fields, hidden Markov models, and hierarchical gated recurrent units).ResultsEvaluation was performed using cross-validation to measure 1) classification accuracy for talk-turns and 2) precision, recall, and F1 scores at the visit level. Experimental results showed that sequential models had higher classification accuracy at the talk-turn level and higher precision at the visit level. Independent models had higher recall scores at the visit level compared with sequential models.ConclusionsIncorporating sequential information across talk-turns improves the accuracy of topic prediction in patient-provider dialog by smoothing out noisy information from talk-turns. Although the results are promising, more advanced prediction techniques and larger labeled datasets will likely be required to achieve prediction performance appropriate for real-world clinical applications